Predicting Airbnb Listing Prices¶

Wasay Hayat, Juhi Grover, Qi Xu, Xuanye Zheng (Allen)

STAT 301 Group 40 Final Report

April 16th, 2025

Introduction¶

Airbnb is an American vacation rental company that operates an online marketplace for homestays in over 240 countries and regions worldwide (1). The rise of short-term rental platforms like Airbnb has transformed the travel accommodation landscape, offering travelers a wide range of lodging options with varying features and price points. Understanding the factors that influence the pricing of these listings can be helpful both for hosts aiming to optimize their revenues and for travelers seeking cost-effective choices. This report investigates how various listing features, locations, and times (weekday vs. weekend) can be used to predict the total price of a two-night stay for two people in popular European cities: Barcelona, Paris, and Vienna.

Our overarching question: How can various listing features as well as location and day type be used to predict the full price of accommodation of an Airbnb listing for two people and two nights in Barcelona, Paris, or Vienna?

Methods and Results¶

Data¶

To answer our question, we will use an observational dataset on Airbnb prices in European cities, downloaded from Kaggle (2).

Combining data from 3 cities (Barcelona, Paris, and Vienna), we have 13,058 observations and 21 columns, including 2 new columns created for this project.

Variable Name Type Description
realSum Numerical Full price of accommodation for two people and two nights (response variable)
room_type Categorical Type of accommodation (private / shared / entire home / apt)
room_shared, room_private Categorical (Binary) Dummy variables for shared and private rooms
multi Categorical (Binary) Dummy variable if the listing belongs to hosts with 2-4 offers
biz Categorical (Binary) Dummy variable if the listing belongs to hosts with more than 4 offers
person_capacity Numerical Maximum number of guests
host_is_superhost Categorical (Binary) Dummy variable for superhost status
cleanliness_rating Numerical Rating of cleanliness (1-10)
guest_satisfaction_overall Numerical Overall guest satisfaction score
bedrooms Numerical Number of bedrooms (0 for studios)
dist, metro_dist Numerical Distance from the city centre and nearest metro station in km
attr_index, attr_index_norm Numerical Attraction index of the listing location, normalized (0-100)
rest_index, rest_index_norm Numerical Restaurant index of the listing location, normalized (0-100)
lng, lat Numerical Longitude and latitude of the listing location)
city (added) Categorical City of the listing location (Barcelona, Paris, or Vienna)
day_type (added) Categorical Weekday or weekend
In [1]:
# Load the required libraries

library(tidyverse)
library(repr)
library(tidymodels)
library(glmnet)
library(patchwork)
library(caret)
library(gridExtra)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──

✔ broom        1.0.6     ✔ rsample      1.2.1
✔ dials        1.2.1     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
✔ recipes      1.1.0     

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org

Loading required package: Matrix


Attaching package: ‘Matrix’


The following objects are masked from ‘package:tidyr’:

    expand, pack, unpack


Loaded glmnet 4.1-8

Loading required package: lattice


Attaching package: ‘caret’


The following objects are masked from ‘package:yardstick’:

    precision, recall, sensitivity, specificity


The following object is masked from ‘package:purrr’:

    lift



Attaching package: ‘gridExtra’


The following object is masked from ‘package:dplyr’:

    combine


In [2]:
# Copied from Wasay's Assignment 2
# Read the data, add columns for city and weekday/weekend, and combine the six datasets into one

barcelona_weekdays <- read_csv("https://raw.githubusercontent.com/awhayat/stat-301-project/refs/heads/main/barcelona_weekdays.csv") |>
    mutate(city = "Barcelona", day_type = "weekday")
barcelona_weekends <- read_csv("https://raw.githubusercontent.com/awhayat/stat-301-project/refs/heads/main/barcelona_weekends.csv") |>
    mutate(city = "Barcelona", day_type = "weekend")
paris_weekdays <- read_csv("https://raw.githubusercontent.com/awhayat/stat-301-project/refs/heads/main/paris_weekdays.csv") |>
    mutate(city = "Paris", day_type = "weekday")
paris_weekends <- read_csv("https://raw.githubusercontent.com/awhayat/stat-301-project/refs/heads/main/paris_weekends.csv") |>
    mutate(city = "Paris", day_type = "weekend")
vienna_weekdays <- read_csv("https://raw.githubusercontent.com/awhayat/stat-301-project/refs/heads/main/vienna_weekdays.csv") |>
    mutate(city = "Vienna", day_type = "weekday")
vienna_weekends <- read_csv("https://raw.githubusercontent.com/awhayat/stat-301-project/refs/heads/main/vienna_weekends.csv") |>
    mutate(city = "Vienna", day_type = "weekend")

airbnb_data <- bind_rows(paris_weekdays, paris_weekends, barcelona_weekdays, barcelona_weekends, vienna_weekdays, vienna_weekends)
head(airbnb_data)
New names:
• `` -> `...1`
Rows: 1555 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 1278 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 3130 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 3558 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 1738 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
New names:
• `` -> `...1`
Rows: 1799 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): room_type
dbl (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
lgl  (3): room_shared, room_private, host_is_superhost

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
A tibble: 6 × 22
...1realSumroom_typeroom_sharedroom_privateperson_capacityhost_is_superhostmultibizcleanliness_rating⋯distmetro_distattr_indexattr_index_normrest_indexrest_index_normlnglatcityday_type
<dbl><dbl><chr><lgl><lgl><dbl><lgl><dbl><dbl><dbl>⋯<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><chr><chr>
0296.1599Private room FALSE TRUE2 TRUE0010⋯0.69982060.1937094518.478925.239381218.662271.608032.3538548.86282Parisweekday
1288.2375Private room FALSE TRUE2 TRUE0010⋯2.10000540.1072207873.217042.507911000.543358.791462.3243648.85902Parisweekday
2211.3431Private room FALSE TRUE2FALSE0010⋯3.30232510.2347238444.556121.64084 902.854553.051312.3171448.87475Parisweekday
3298.9561Entire home/aptFALSEFALSE2FALSE01 9⋯0.54756670.1959965542.142026.391291199.184270.463512.3560048.86100Parisweekday
4247.9262Entire home/aptFALSEFALSE4FALSE00 7⋯1.19792090.1035729406.929019.809161070.775562.918272.3591548.86648Parisweekday
5527.0761Entire home/aptFALSEFALSE4 TRUE0010⋯1.54320150.5491303967.478147.096511095.870464.392842.3320148.85891Parisweekday
In [3]:
# Confirm that all of the data has been combined
n_rows_total = (nrow(barcelona_weekdays) + nrow(barcelona_weekends)
                + nrow(paris_weekdays) + nrow(paris_weekends)
                + nrow(vienna_weekdays) + nrow(vienna_weekends))

n_rows_total
nrow(airbnb_data)
13058
13058

Exploratory Data Analysis¶

Change variable types as needed¶

In [4]:
# Convert City, Room Type and Day Type into factors
airbnb_data$city <- as.factor(airbnb_data$city)
airbnb_data$day_type <- as.factor(airbnb_data$day_type)
airbnb_data$room_type <- as.factor(airbnb_data$room_type)

Handling negative values, missing values, and outliers¶

In [5]:
# Drop NAs or blank values
airbnb_data <- drop_na(airbnb_data)

# Remove negative values for realSum, if any
airbnb_data <- airbnb_data[airbnb_data$realSum >= 0, ]

# Identifying outliers with z-scores. An observation with a z-score greater than 3 or less than -3 can be considered an outlier.
z_scores <- (airbnb_data$realSum - mean(airbnb_data$realSum)) / sd(airbnb_data$realSum)

outliers_index <- abs(z_scores) > 3

# Remove Outliers
airbnb_data <- airbnb_data[!outliers_index, ]

# Remove More Values Based on the IQR 
Q1 <- quantile(airbnb_data$realSum, 0.25)
Q3 <- quantile(airbnb_data$realSum, 0.75)
IQR_value <- Q3 - Q1

# Calculate Upper and Lower Bounds using IQR
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value

# Remove Values Outside Bounds
filtered_data <- airbnb_data[airbnb_data$realSum >= lower_bound & airbnb_data$realSum <= upper_bound, ]

Explore potential variables¶

In [6]:
# Copied from Juhi's Assignment 2

price_vs_bedrooms <- ggplot(filtered_data, aes(x = bedrooms, y = realSum)) +
  geom_point(aes(color = city)) +
  geom_smooth(method = "lm", se = FALSE, aes(color = city)) +
  facet_wrap(~ city) +  # Facet by city
  labs(title = "Price vs Number of Bedrooms", x = "Number of Bedrooms", y = "Price")+
    theme(legend.position = "top", plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),)

price_density <- ggplot(filtered_data, aes(x = realSum, fill = city)) +
  geom_density(alpha = 0.5) + 
  labs(title = "Density of Prices by City", x = "Price", y = "Density") +
  theme_light()+
    theme(legend.position = "top", plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
    axis.title = element_text(size = 12),
    axis.text = element_text(size = 10),)

options(repr.plot.width = 11, repr.plot.height = 5)

grid.arrange(price_density, price_vs_bedrooms, ncol = 2)
`geom_smooth()` using formula = 'y ~ x'
No description has been provided for this image

In the first plot, we see the density of the prices across cities. It is evident that most of the prices in all three cities fall below 500 euros. The most common price in Barcelona is ~ 150 euros, in Vienna is ~ 180 euros, and in Paris is ~ 260 euros. Overall, the prices in Paris appear to be higher than the prices in Barcelona and Vienna. The distribution for all three cities is right-skewed. Paris has the heaviest right tail, followed by Barcelona and Vienna. This tells us that most of the Airbnb listings in all three cities are clustered at lower price points, but there are few premium or luxury listings with much higher prices. Since Paris has the heaviest tail, it has the most number of luxury listings. This is after removing outliers, which suggests that these listings are typical of the city (and are not abnormal).

In the second plot, we see how the distribution of prices changes with the number of bedrooms, across the three cities. We can see that the price tends to increase as the number of bedrooms increase, for all three cities. However, by inspection, we can see that, as the number of bedrooms increases, the price increases the most in Barcelona and the least in Vienna. This tells us that the price increases more steeply in Barcelona than in Paris or Vienna (as the number of bedrooms increase).

In [7]:
summary_table <- filtered_data %>%
  group_by(city, day_type) %>%
  summarise(
    n = n(),
    avg_price = round(mean(realSum, na.rm = TRUE), 2),
    median_price = round(median(realSum, na.rm = TRUE), 2),
    sd_price = round(sd(realSum, na.rm = TRUE), 2)
  )

print(summary_table)
`summarise()` has grouped output by 'city'. You can override using the
`.groups` argument.
# A tibble: 6 × 6
# Groups:   city [3]
  city      day_type     n avg_price median_price sd_price
  <fct>     <fct>    <int>     <dbl>        <dbl>    <dbl>
1 Barcelona weekday   1468      245.         202.    124. 
2 Barcelona weekend   1210      247.         197.    132. 
3 Paris     weekday   2788      324.         301.    125. 
4 Paris     weekend   3217      326.         300.    125. 
5 Vienna    weekday   1724      222.         204.     89.2
6 Vienna    weekend   1784      231.         210.     96.8

The summary table provides an overview of Airbnb listings across cities and day types. Paris has the highest average rental prices on both weekdays (€324) and weekends (€326), along with the largest number of listings. In contrast, Vienna exhibits the lowest prices and the smallest variability (SD ≈ €90–97), indicating a more stable pricing pattern. Interestingly, rental prices remain relatively stable between weekdays and weekends within each city, suggesting that location and city-specific factors may play a larger role in price variation than day type alone.

Methods: Plan¶

Overview¶

Our variable selection process aims to select the best possible variables for our analysis using the lasso model. Our process follows the following steps:

  • We start by splitting the data into testing and training sets (70% training, 30% testing)
  • We then check for: normality of the residual, homoschedasticity and linearity.
  • We then begin our lasso process by choosing the best value for lambda
  • Once lambda is chosen, we predict the values of our training data based on our model.
  • We finally calculate our RMSE, MAE and R^2.

Variable Selection¶

We selected Lasso regression as the primary method for modeling Airbnb rental prices. Lasso extends traditional linear regression by incorporating an L1 regularization term, which encourages sparsity in the model coefficients. This has two major advantages: it performs automatic variable selection, and it reduces the impact of multicollinearity by shrinking correlated or unimportant predictors toward zero.

Given the number of explanatory variables in our dataset—including numerical ratings, binary indicators, and location-based features—Lasso is well-suited for balancing model complexity with interpretability. To avoid overfitting, we applied 10-fold cross-validation to choose the optimal penalty parameter (lambda). This approach ensures that the model generalizes well to unseen data and aligns with methods taught in class.

The Lasso model is appropriate for the characteristics of our dataset, which includes a mix of categorical and numerical predictors and a right-skewed continuous response variable (rental price). The presence of potentially irrelevant or correlated predictors makes Lasso a better choice than ordinary least squares, as it can regularize and exclude variables that do not contribute meaningfully to prediction.

Additionally, we validated the assumptions of linear modeling by checking the normality of residuals through a Q-Q plot and assessing homoscedasticity and linearity via a residual-vs-fitted plot. These diagnostic checks suggest that a linear model is reasonably appropriate.

In [8]:
set.seed(123)

n <- nrow(filtered_data)
train_idx <- sample(1:n, size = 0.7 * n)

# Split data into Testing and Training Sets
train_data <- filtered_data[train_idx, ]
test_data  <- filtered_data[-train_idx, ]

# Check for Normality of residual
# Plot a Normal Q-Q Plot
sampled_data <- train_data %>%
  sample_n(50)
lm_model <- lm(realSum ~ ., data = sampled_data)
lm_residuals <- residuals(lm_model)
lm_fitted <- fitted(lm_model)
qqnorm(lm_residuals)
qqline(lm_residuals, col = "red")

# Check for Homoscedasticity and linearity
# Plot a Residual vs Fitted Plot
plot(lm_fitted, lm_residuals, 
     main = "Residuals vs Fitted", 
     xlab = "Fitted Values", 
     ylab = "Residuals",
     pch = 20, col = "blue")
abline(h = 0, col = "red")
No description has been provided for this image
No description has been provided for this image

To evaluate whether the assumptions of linear regression were reasonably met, we examined two diagnostic plots. The Q-Q plot of residuals shows mild deviation from the reference line at both tails, suggesting slight departures from normality, especially in the upper quantiles. However, the residuals are mostly aligned along the diagonal, indicating that the normality assumption is approximately satisfied. The Residuals vs Fitted plot does not exhibit strong funnel shapes or non-linear patterns, suggesting that the assumptions of homoscedasticity and linearity are largely reasonable. Together, these plots support the use of linear modeling techniques on our data.

In [9]:
X <- model.matrix(realSum ~ ., data = train_data)[, -1]
y <- train_data$realSum

# Cv validate lasso
cv.lasso <- cv.glmnet(X, y, alpha = 1)

# Plot cv to see every lambda
plot(cv.lasso)

# Choose Best Lambda Value
best_lambda <- cv.lasso$lambda.min
print(paste("Best lambda value:", best_lambda))

# Fit Lasso Model
lasso_model <- glmnet(X, y, alpha = 1, lambda = best_lambda)
print(coef(lasso_model))

# Predict on Testing Data
X_test <- model.matrix(realSum ~ ., data = test_data)[, -1]
y_test <- test_data$realSum

y_pred <- predict(lasso_model, newx = X_test)

# Calculate accuracies (RMSE, MAE, R^2)
rmse <- sqrt(mean((y_test - y_pred)^2))
mae <- mean(abs(y_test - y_pred))
r2 <- 1 - sum((y_test - y_pred)^2) / sum((y_test - mean(y_test))^2)

cat("Test RMSE:", round(rmse, 2), "\n")
cat("Test MAE:", round(mae, 2), "\n")
cat("Test R-squared:", round(r2, 4), "\n")
[1] "Best lambda value: 0.0883337683069244"
24 x 1 sparse Matrix of class "dgCMatrix"
                                      s0
(Intercept)                 3.417719e+01
...1                       -4.472194e-03
room_typePrivate room      -6.568265e+01
room_typeShared room       -1.939128e+02
room_sharedTRUE            -3.587096e-10
room_privateTRUE           -2.795785e+00
person_capacity             3.108834e+01
host_is_superhostTRUE       6.634350e+00
multi                       1.553394e+01
biz                         4.508440e+01
cleanliness_rating          1.025988e+01
guest_satisfaction_overall  9.851857e-02
bedrooms                    3.787600e+01
dist                       -1.866126e-01
metro_dist                  .           
attr_index                 -1.975814e-01
attr_index_norm             7.046662e+00
rest_index                  .           
rest_index_norm             1.129251e+00
lng                        -5.860371e+00
lat                        -2.428128e-01
cityParis                   .           
cityVienna                  .           
day_typeweekend             8.590745e+00
Test RMSE: 89.2 
Test MAE: 67.53 
Test R-squared: 0.5138 
No description has been provided for this image

To prevent overfitting and enhance interpretability, we applied Lasso regression with L1 regularization, which inherently performs variable selection by shrinking less relevant coefficients to zero. We used the cv.glmnet() function to perform 10-fold cross-validation, selecting the value of the penalty parameter λ that minimized the mean squared error on held-out folds. The selected lambda value was approximately 0.15, as shown in the cross-validation plot.

This regularization process reduced our model to a subset of informative predictors, including room_type, person_capacity, cleanliness_rating, guest_satisfaction_overall, bedrooms, dist, attr_index_norm, rest_index_norm, and day_type. Variables such as host_is_superhost, biz, and lng were also retained, while others (e.g., city dummies, room_shared, lat) were shrunk to zero and effectively excluded from the model.

The final Lasso model achieved a Test RMSE of 89.21, MAE of 67.53, and an R-squared of 0.5137, indicating that it explains approximately 51% of the variance in Airbnb rental prices on the test set. Among the retained variables, room_typePrivate and room_private had large negative coefficients, suggesting that private or fully private rooms tend to be priced significantly lower. In contrast, being a superhost and having higher cleanliness_rating or guest satisfaction scores were associated with higher prices, aligning with intuitive expectations.

Notably, day_typeWeekend had a positive coefficient (~8.67), indicating that listings are on average more expensive on weekends, controlling for other factors. The model also highlighted the influence of location-based features like dist (distance to city center) and rest_index_norm (restaurant accessibility), reinforcing the spatial component of rental pricing.

Overall, the model provides both predictive utility and interpretable insights into the key drivers of Airbnb pricing in European cities.

Discussion¶

Our project focused on predicting Airbnb rental prices using data from Paris, Barcelona, and Vienna. We addressed issues such as missing values, negative prices, and outliers. To reduce multicollinearity and select the most predictive variables, we used Lasso regression, which regularizes the model by penalizing less informative features. This process helped us to identify which features had the greatest influence on the final rental price (realSum). Among those retained by the Lasso model were room_type, person_capacity, host_is_superhost, bedrooms, biz, cleanliness_rating, and various attraction and location-related indices. Obviously, variables such as metro_dist, rest_index, and even city were not selected, which suggests either redundancy with other features or minimal marginal contribution to price prediction. Overall, the results were mostly in line with our expectations. Features like room_type, bedrooms, and person_capacity were strong predictors of price, which aligns with typical Airbnb pricing logic.

Our final model had a test RMSE of 89.2, MAE of 67.53, and an R-squared of 0.512. This means that around 51% of the variance in rental prices could be explained by the included predictors. While this is a moderate level of accuracy, it highlights the complexity of Airbnb pricing, which can be influenced by many external or unobserved factors not captured in the dataset. For instance, seasonability, reviews, or nearby events.

Looking ahead, the model could be improved by creating interaction terms or combining multiple distance-based variables into a unified measure of centrality. One area worth exploring is how seasonability and event-based demand affect Airbnb prices, especially in tourist-heavy cities.

References¶

  • (1) About us - Airbnb newsroom: https://news.airbnb.com/about-us/
  • (2) Original data source: Gyódi, K.and Ł. Nawaro. Determinants of Airbnb Prices in European Cities: A Spatial Econometrics Approach (supplementary Material). Zenodo, 13 Jan. 2021, doi:10.5281/zenodo.4446043.